Take-home Exercise 2

1. Overview

The task Take-home Exercise 2 is to select one of the Take-home Exercise 1 prepared by our classmate, critic the submission in terms of clarity and aesthetics,prepare a sketch for the alternative design by using the data visualisation design principles and best practices we had learned in Lesson 1 and 2, and remake the original design by using ggplot2, ggplot2 extensions and tidyverse packages.

2. Installing and loading R packages

Two packages will be installed and loaded. They are tidyverse and ggiraph.

pacman::p_load(ggiraph, tidyverse, ggplot2,dplyr,funModeling,ggpubr, plotly )

3. Importing data

population <- read_csv("data/respopagesextod2022.csv")
Rows: 100928 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): PA, SZ, AG, Sex, TOD
dbl (2): Pop, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4. Visualization Critique

4.1 Original design

The following is the age-sex pyramid provided by my classmate in this take-home exercise 1 and it will be reviewed and remade in terms of clarity and aesthetics.

4.2 Clarity

4.2.1 Graphical Integrity: Show Me the Truth

The better way for define the age group segmentation should be 0-4, 5-9, etc, and not “children”,“early working age”, etc. The reason is “children”,“early working age”, etc is too general, it will mask out age segmentatio signal.

4.2.2 Visualising the Right Data

Absolute values will reveal more interesting patterns than the derived values. From the derived values which is population in percentage, we will not know what is the number of female and male population difference from the 9 most populated planning areas.

4.2.3 Reference line

In the remake, the reference line is added which is the avg population (sum of total population divided by 9 planning areas/19 age groups/2 genders).

4.2.4 interactive function

Add on interactive function into the chart to show the population numbers when move cursor.

4.3 Aesthetics

4.3.1 Application of pre-attentive principle

Colours. The original age-sex pyramid used one color for two genders. However, this display does not show a clear comparison betewwn male and female population. It does not help user to distinguish the genders from the 1st attention. . It’s suggested to choose two different colors to indicate the two different genders in this visualization.

4.3.2 X-Axis Title & Labels

The population number is large. It’s good to show number in the unit of thousand With the aid of the newly added x-axis title and labels, it makes it easier for users to read and interpret this visualization. The X-axis labels will be converted to positive values because population should be positive values.

4.3.3 Caption

The caption to mention the data source for this visualization has also been included at the bottom in a clean manner.

5. Visualization Remake step by step

Calculate total population count by planning area

pop_pa <- population %>%
  select(PA,AG,Sex,Pop) %>%
  group_by(PA) %>%
  summarise(totalPop = sum(Pop))

Sort planning areas by population

pop_sorted <- pop_pa[order(pop_pa$totalPop, decreasing = TRUE), ]

Select top 9 planning areas by population

pop_filtered <- head(pop_sorted, 9)

Filter the raw dataset according to the top 9 planning areas by population

pop_pa_filtered <- population %>%
  filter(PA %in% pop_filtered$PA) 

aggregate data by planning areas, age group and sex

Pop_pa_age_sex <- aggregate(Pop ~ PA + AG + Sex, data = pop_pa_filtered, FUN = sum)

Sort dataset by top 9 planning areas and age group

verify the age groups

freq(data=Pop_pa_age_sex , 
     input = 'AG')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

            AG frequency percentage cumulative_perc
1       0_to_4        18       5.26            5.26
2     10_to_14        18       5.26           10.52
3     15_to_19        18       5.26           15.78
4     20_to_24        18       5.26           21.04
5     25_to_29        18       5.26           26.30
6     30_to_34        18       5.26           31.56
7     35_to_39        18       5.26           36.82
8     40_to_44        18       5.26           42.08
9     45_to_49        18       5.26           47.34
10      5_to_9        18       5.26           52.60
11    50_to_54        18       5.26           57.86
12    55_to_59        18       5.26           63.12
13    60_to_64        18       5.26           68.38
14    65_to_69        18       5.26           73.64
15    70_to_74        18       5.26           78.90
16    75_to_79        18       5.26           84.16
17    80_to_84        18       5.26           89.42
18    85_to_89        18       5.26           94.68
19 90_and_over        18       5.26          100.00

order the age groups

Pop_pa_age_sex$AG[which(Pop_pa_age_sex$AG=="0_to_4")] <-"00_to_04"
Pop_pa_age_sex$AG[which(Pop_pa_age_sex$AG=="5_to_9")] <-"05_to_09"

Derived the population in thousand and avg population in thousand

pop_final <- Pop_pa_age_sex %>%

  mutate(Pop_2=round(Pop/1000,2))%>%
  mutate(Avg=round(sum(Pop)/1000/9/19/2,2))%>%
  arrange(PA, AG)

Create the age_sex_pyramid

age_sex_pyramid <- ggplot(data=pop_final,aes(x=AG,fill=Sex)) +
  theme_bw() + ## change background color to white
  geom_bar(data=subset(pop_final,Sex=="Females"),stat='identity',aes(y=Pop_2)) +
  geom_bar(data=subset(pop_final,Sex=="Males"),stat='identity',aes(y=Pop_2*(-1)))  + 
  scale_y_continuous(breaks=seq(-20,20,5),labels=abs(seq(-20,20,5))) +
  facet_wrap(~ PA)+
  coord_flip()+
  theme_bw() +
  scale_fill_manual(values = c("Males" = "blue", 
                               "Females" = "red")) +


  labs(x = "Age Group", 
       y = "Population(in thousand)",
       title = "Singapore Population Pyramid by age&sex from top 9 Planning Areas in June 2022",  
       subtitle = "Top 9 Planning Areas by Population, 2022",
       caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data")+

    theme(plot.title = element_text(hjust=0.5, size=14),
          plot.subtitle = element_text(hjust = 0.5,size = 8),
          legend.title = element_text(size=10),
          legend.text = element_text(size=8),
          axis.text = element_text(face="bold"),
          axis.ticks.x=element_blank(),
          axis.text.x = element_text(angle = 0),
          axis.title.y=element_text(angle=0))

age_sex_pyramid_final <-age_sex_pyramid + geom_hline(yintercept = pop_final$Avg,linetype="dotted", color = "black") + geom_hline(yintercept = -(pop_final$Avg),linetype="dotted", color = "black")+

  geom_text(aes(0,Avg,label = 'Avg', vjust = -1))+
  geom_text(aes(0,-Avg,label = 'Avg', vjust = -1)) 

ggplotly(age_sex_pyramid_final)